xia2.multiplex: a multi-crystal data-analysis pipeline

A new program, xia2.multiplex, has been developed to facilitate symmetry analysis, scaling and merging of multi-crystal data sets.

In macromolecular crystallography, radiation damage limits the amount of data that can be collected from a single crystal. It is often necessary to merge data sets from multiple crystals; for example, small-wedge data collections from microcrystals, in situ room-temperature data collections and data collection from membrane proteins in lipidic mesophases. Whilst the indexing and integration of individual data sets may be relatively straightforward with existing software, merging multiple data sets from small wedges presents new challenges. The identification of a consensus symmetry can be problematic, particularly in the presence of a potential indexing ambiguity. Furthermore, the presence of nonisomorphous or poor-quality data sets may reduce the overall quality of the final merged data set. To facilitate and help to optimize the scaling and merging of multiple data sets, a new program, xia2.multiplex, has been developed which takes data sets individually integrated with DIALS and performs symmetry analysis, scaling and merging of multi-crystal data sets. xia2.multiplex also performs analysis of various pathologies that typically affect multi-crystal data sets, including non-isomorphism, radiation damage and preferential orientation. After the description of a number of use cases, the benefit of xia2.multiplex is demonstrated within a wider autoprocessing framework in facilitating a multicrystal experiment collected as part of in situ room-temperature fragmentscreening experiments on the SARS-CoV-2 main protease.

Introduction
Macromolecular structure determination routinely uses data sets obtained under cryogenic conditions from a single crystal. However, radiation damage limits the amount of data that can be collected from a single crystal. Cryocooling vastly increases the dose that can be tolerated by a single crystal, leading to the dominance of cryo-crystallography in macromolecular structure determination (Garman, 1999;Garman & Owen, 2007). However, it is often still necessary to merge multiple data sets from one or more crystals when dealing with radiationsensitive samples and high-brilliance X-ray beams from thirdgeneration light sources.
Multi-crystal data collection dates back to the early days of macromolecular crystallography (Kendrew et al., 1960;Clemons et al., 2001), but has seen a resurgence in recent years (Yamamoto et al., 2017) as many scientifically important targets, such as membrane proteins and viruses, frequently yield small, weakly diffracting microcrystals. The development of crystallization in lipidic mesophases (Caffrey, 2003(Caffrey, , 2015 and the availability of microfocus beamlines Smith et al., 2012) have facilitated data collection and structure solution of these difficult targets. Data-collection strategies for small weakly diffracting crystals rely on the collection of many small wedges of data, typically 5-10 per crystal, at cryogenic temperatures. For samples in the lipidic mesophase this is often preceded by X-ray raster scanning to identify the locations of crystals (Cherezov et al., 2007(Cherezov et al., , 2009Rasmussen et al., 2011;Rosenbaum et al., 2011;Warren et al., 2013). Such experiments are becoming increasingly automated thanks to developments such as MeshAndCollect (Zander et al., 2015) and ZOO (Hirata et al., 2019).
Multi-crystal data collections have also been applied to experimental phasing, where combining data from multiple crystals enhances weak anomalous signals, providing highmultiplicity data of sufficient quality to enable structure solution by single-wavelength anomalous dispersion (SAD; Liu et al., 2011;Liu & Hendrickson, 2015) and sulfur SAD (S-SAD; Akey et al., 2014;Liu et al., 2014;Huang et al., 2015Huang et al., , 2016Olieric et al., 2016).
Although cryogenic structures have provided the gold standard for structural analysis of macromolecules for decades, it has been shown that cryocooling can hide biologically significant structural features (Fraser et al., 2009(Fraser et al., , 2011Fischer et al., 2015). Certain classes of macromolecular crystals, such as viruses, can also suffer when cryocooled. However, room-temperature data collection presents its own challenges, namely that radiation damage occurs at an absorbed dose one to two orders of magnitude lower than at cryogenic temperatures (Helliwell, 1988;Nave & Garman, 2005). In contrast to cryogenic data collections, an inverse dose-rate effect on crystal lifetime has been observed in roomtemperature data (Southworth-Davies et al., 2007). As a result, obtaining a complete room-temperature data set from a single crystal is difficult, so combining data from multiple crystals becomes necessary.
As the demand for room-temperature methods has increased, beamline developments have enabled routine room-temperature data collection directly from crystals in crystallization plates (in situ). This has the added benefit of eliminating the need for crystal harvesting (Axford et al., 2012Aller et al., 2015), and a beamline, VMXi at Diamond Light Source, now exists that is dedicated to in situ data collection (Sanchez-Weatherby et al., 2019). Advances in beamline and detector technology have enabled the collection of room-temperature data at a higher dose rate (Owen et al., 2012(Owen et al., , 2014Schubert et al., 2016), increasing the general applicability of room-temperature data collection (Aller et al., 2015;Broecker et al., 2018).
Merging multiple data sets from small wedges presents a number of challenges. For novel structures with unknown space group and unit-cell parameters, identifying a consensus symmetry can be problematic, particularly in the presence of indexing ambiguities (Brehm & Diederichs, 2014;Kabsch, 2014;. The presence of non-isomorphous or poor-quality data sets may also degrade the overall quality of the merged data set. Various methods have been developed to identify individual non-isomorphous data sets based on the comparison of unit-cell parameters (Foadi et al., 2013;Zeldin et al., 2015) or intensities (Giordano et al., 2012;Santoni et al., 2017;Diederichs, 2017) in order to combat this. Rogue data sets, or even individual bad images, can be identified by algorithms such as the ÁCC 1/2 method described by Assmann et al. (2016) and implemented within dials.scale (Beilsten-Edmands et al., 2020).
Microcrystal and room-temperature data-collection strategies are a compromise between maximizing the useful signal and minimizing the effects of radiation damage. By analysing manifestations of radiation damage, we can provide rapid feedback to guide an ongoing experiment and truncate the number of images used to produce the best final composite data set. The R cp statistic introduced by Winter et al. (2019) can also be applied to multi-crystal data, under the assumption that the dose per image is approximately constant for all data sets. This may be appropriate for multi-crystal data collections where approximately uniformly sized crystals are bathed in the X-ray beam.
Preferential orientation of crystals can be a concern for some multi-crystal data collections, depending on the crystal symmetry and morphology, such as plate-like crystals in situ within a flat-bottomed crystallization well. Preferential orientation can lead to under-sampled regions of reciprocal space with systematically low-multiplicity or missing reflections, which may have adverse consequences on downstream phasing or refinement. Providing feedback on preferential orientation provides the opportunity for a user to make modifications to their experiment to minimize any resulting issues, for example by fully exploiting the available experimental geometry or changing the crystallization conditions or platform (Maeki et al., 2016).
Structural biologists have become accustomed to the highly automated data analysis provided by synchrotron beamlines around the world (Holton & Alber, 2004;Winter, 2010;Vonrhein et al., 2011;Winter & McAuley, 2011;Winter et al., 2013;Monaco et al., 2013;Yamashita et al., 2018), typically obtaining automated data-processing results within minutes of the end of data collection for routine experiments. Multicrystal experiments can generate large volumes of data in minutes, which brings new challenges in terms of bookkeeping and data analysis.
There are two primary aspects in which automated data analysis can support multi-crystal experiments. Firstly, rapid feedback from data analysis during beamtime can help to guide ongoing experiments, enabling a more efficient use of beamtime and allowing a user to more selectively screen sample conditions. Relevant feedback may include suitable metrics on merged data quality, i.e. completeness, multiplicity and resolution, and feedback on experimental pathologies, such as non-isomorphism, radiation damage and preferential orientation, that may hinder the experimental goals.
Secondly, after the completion of beamtime the user may be prepared to invest more time and effort in interactively optimizing the best overall data set for any given sample group. Automation is still highly relevant in this context, as the user may have collected data on many sample groups which they wish to process in a similar manner.
Standard autoprocessing pipelines such as xia2 (Winter, 2010) can handle multi-crystal data sets to some extent. However, they are optimized to process a small number of relatively complete data sets, rather than the many tens to hundreds of severely incomplete data sets that comprise a multi-crystal experiment. Recent software developments, for example KAMO (Yamashita et al., 2018), have focused on automating the data processing of multi-crystal experiments.
Here, we present a new program, xia2.multiplex, which has been developed to facilitate the scaling and merging of multiple data sets. It takes data sets individually integrated with DIALS as input and performs symmetry analysis, scaling and merging, and analyses the various pathologies that typically affect multi-crystal data sets, including nonisomorphism, radiation damage and preferential orientation.
Using data sets collected as part of in situ room-temperature fragmentscreening experiments on the SARS-CoV-2 main protease (M pro ), we demonstrate the use of xia2.multiplex within a wider autoprocessing framework to give rapid feedback during a multi-crystal experiment, and how the program can be used to further improve the quality of the final merged data set.

Methods
Prior to using xia2.multiplex, each data set should be processed individually with DIALS . Data may be processed either in the primitive P1 setting, or alternatively Bravais symmetry may be determined prior to integration using dials.refine_bravais_ settings. It is not necessary to individually scale the data at this point.
Preliminary filtering of data sets is performed using hierarchical unit-cell clustering methods (Zeldin et al., 2015). Histograms and scatterplots of the unitcell distribution are generated for visual analysis, after which symmetry analysis and indexing-ambiguity resolution are performed with dials.cosym. Finally, the data are scaled with dials.scale, followed by radiation-damage and isomorphism analysis. The main sequence of steps taken by xia2.multiplex is outlined in Fig. 1.

Symmetry analysis
Initial analysis of the Patterson symmetry of the data is performed using dials.cosym . This is an extension of the methods of Brehm & Diederichs (2014) for resolving indexing ambiguities in partial data sets and for completeness is reviewed here.
The maximum possible lattice symmetry compatible with the averaged unit cell is used to compile a list of all potential symmetry operations. The matrix of pairwise correlation coefficients is constructed, of size (n Â m) 2 , where n is the number of data sets and m is the number of symmetry operations in the lattice group. The Pearson correlation coefficient between data sets i and j, after the application of the kth and lth symmetry operators respectively, is defined according to where I i k ðhÞ is the scaled intensity for data set i of the reflection with Miller index h after application of the kth symmetry operator.
Similarly to Brehm & Diederichs (2014), correlation coefficients are only calculated for pairs of data sets with three or more reflections in common. If a pair of data sets have two or fewer common reflections, then the correlation coefficient for that pair is assumed to be zero. The minimum number of common reflections required for the calculation of correlation coefficients is configurable in dials.cosym and xia2.multiplex.
Each data set is represented as n Â m coordinates in an m-dimensional space. Use of an m-dimensional space allows the presence of up to m orthogonal x i clusters, where the orthogonality between clusters corresponds to a correlation coefficient r i k ;j l close to zero. A modification of algorithm 2 of Brehm & Diederichs (2014), accounting for the additional symmetry-related copies of each data set, is used to iteratively minimize the function using the L-BFGS minimization algorithm (Liu & Nocedal, 1989), with randomly assigned starting coordinates x in the range 0-1. 2.1.1. Determination of the number of dimensions. It is necessary to use a sufficient number of dimensions to represent any systematic variation that is present between data sets. Using m-dimensional space, where m is equal to the number of symmetry operations in the maximum possible lattice symmetry, should be sufficient to represent any systematic variation present due to pseudosymmetry. However, choosing the optimal number of dimensions is a balance between underfitting and overfitting. Using more dimensions than is strictly necessary may reduce the stability of the minimization, particularly in the case of sparse data, where there is minimal overlap between data sets. As a result, we devised the following procedure to automatically determine the necessary number of dimensions.
(i) For each dimension in the range 2-m minimize equation (2) and record the final value of the function.
(ii) Plot the resulting values as a function of the number of dimensions.
(iii) Determine the 'elbow' point of the plot, in a similar manner to that used by Zhang et al. (2006), to give the optimal number of dimensions.
Alternatively, the user may specify the number of dimensions to be used for the analysis.

Identification of symmetry.
Patterson group symmetry is determined using an algorithm heavily influenced by the program POINTLESS (Evans, 2006(Evans, , 2011. Evans (2011) estimates the likelihood of a symmetry element S k being present, given the correlation coefficient CC k , as The probability of observing the correlation coefficient CC k if the symmetry is present, p(CC k ; S k ), is modelled as a truncated Lorentzian centred on the expected value of CC if the symmetry is present, E(CC; S), with a width parameter = (CC k ).
The distribution of CC k if the symmetry is not present is modelled as Diederichs (2017) makes clear that the relationship between the results of the clustering procedure outlined above and the correlation coefficient r ij between two data sets i and j is The lengths of the vectors |x i | are inversely related to the amount of random error, i.e. they provide an estimate of CC*. The maximum possible correlation coefficient between two data sets is given by the product of their CC* values. The angles between two vectors represent genuine systematic differences. For points related by genuine symmetry operations we expect cos[/(x i , x j )] ' 1, whereas for points related by symmetry operations that are not present we expect We can therefore use cos[/(x i , x j )] in place of CC k , with E(CC; S) = 1. The estimated error (CC k ) used by Evans (2011) has a lower bound of 0.1, which is intended to avoid very small values of (CC k ) when large numbers of reflections contribute to the calculation of CC k . Since many reflections are contributing indirectly to the angles between any one pair of vectors, we can assume a value for the truncated Lorentzian width parameter of = (CC k ) = 0.1. The average of all observations of cos[/(x i , x j )] corresponding to a given symmetry operator S k is used as an estimate of CC k .
Once a score has been assigned to each potential symmetry operator, all possible point groups compatible with the lattice group are scored as in Appendix A2 of Evans (2011), (i) Find the highest lattice symmetry compatible with the unit-cell dimensions.
(ii) Score each potential rotation operation using all reflections related by that operation.
(iii) Score possible subgroups (Patterson groups) according to combinations of symmetry elements.
Once the most likely Patterson group has been identified by the above procedure, it is then relatively straightforward to assign a suitable re-indexing operation to each data set to ensure that all data sets are consistently indexed. Firstly, a high-density point is chosen as a seed for the cluster. Then, for each data set, the nearest symmetry copy of that data set to the seed is identified. The symmetry operation corresponding to this symmetry copy is then the re-indexing operation for this data set.

Unit-cell refinement
After symmetry determination, an overall best estimate of the unit cell is obtained by refinement of the unit-cell parameters against the observed 2 angles using dials.two_ theta_refine (Winter et al., 2022). This program minimizes the unit-cell constants against the difference between observed and calculated 2 values, which are determined from background-subtracted integrated centroids. This provides an overall best estimate of the unit cell that is a suitable representative average for use in subsequent downstream phasing and refinement.

Scaling
Data are then scaled using the physical scaling model in dials.scale (Beilsten-Edmands et al., 2020). xia2.multiplex uses the automatic scaling-model selection within dials.scale to enable a suitable model parameterization for both the cases of small-wedge data sets and large-wedge data sets. For smallwedge data sets, each data set is corrected by an overall scale factor and relative B factor that vary smoothly as a function of rotation angle, whereas the absorption correction of the physical scaling model is not used as this correction requires the sampling of a diverse set of scattering paths through the sample. For large-wedge data sets, the absorption correction of the physical scaling model is used in addition to the smoothly varying scale and B-factor corrections. The strength of the absorption correction can optionally be set to low (the default), medium or high. This option adjusts the absorption model parameterization and restraints to enable a correction that more closely matches the expected relative absorption, which can be high at long wavelengths or for crystals containing heavy atoms.
Several rounds of outlier rejection are performed during scaling to remove individual reflections that have poor agreement with their symmetry equivalents. The uncertainties of the intensities are also adjusted during scaling by optimizing a single error model across all data sets in order to account for the effects of systematic errors, which tend to increase the variability of intensities within each symmetry-equivalent group. Optionally, for anomalous data, Friedel pairs can be treated separately in scaling, which can increase the strength of the detected anomalous signal.

Estimation of resolution cutoff
After the data have successfully been scaled, dials.estimate_ resolution is used to estimate a suitable resolution cutoff for the data. By default, this is determined from a fit of a hyperbolic tangent to CC 1/2 calculated in resolution bins, similar to that used by AIMLESS (Evans & Murshudov, 2013). The resolution cutoff is chosen as the resolution where the fit curve reaches CC 1/2 = 0.3 (this cutoff value can be controlled by the user). A second round of scaling with dials.scale is then performed after application of the resolution cutoff. The default cutoff value of CC 1/2 = 0.3 is chosen as one that works well in the context of autoprocessing in order to provide a consistent set of merging statistics for judging data quality during an ongoing experiment. Suitable cutoff values may depend on the downstream data-processing requirements, but the current gold standard for publication is to use 'paired refinement' to determine the resolution at which including higher resolution data in refinement no longer improves the model (Karplus & Diederichs, 2012).

Space-group identification
After the data have been scaled in the Patterson group identified by dials.cosym (Section 2.1), analysis of potential systematic absences is performed by dials.symmetry in order to assign a final space group. In this analysis, the existence of each potential screw axis allowed by the Patterson group is tested by calculating the z-score based on the deviation from zero of the merged hI/(I)i for the expected absent reflections. From the individual z-scores, a likelihood of the presence of each screw axis is determined; these are combined to score and select the most likely non-enantiogenic space group.

Analysis of radiation-damage indicators
xia2.multiplex performs a number of analyses that can be useful in assessing the extent of any radiation damage which may be present. Plots of scale factor and R merge versus image number are generated to look for any trends associated with radiation damage. The R cp statistic introduced by Winter et al.
(2019) can also be applied to multi-crystal data. This statistic accumulates the pairwise measured intensity differences as a function of dose (or image number). In the absence of accurate dose information for each data set, it is necessary to make the assumption that the dose per image is approximately constant for all data sets. In order to assess how many images per crystal are necessary to achieve a complete data set, a plot of completeness versus dose is also generated.

Isomorphism analysis
Unit-cell clustering, as implemented in BLEND (Foadi et al., 2013) and elsewhere (Zeldin et al., 2015), is used by xia2.multiplex as a preliminary filtering step to reject any highly non-isomorphous data sets.
xia2.multiplex implements two alternative intensity-based clustering methods that are suitable for the identification and analysis of non-isomorphism, once symmetry determination, research papers resolution of indexing ambiguities and scaling have been carried out as described above. Clustering on correlation coefficients (Giordano et al., 2012;Santoni et al., 2017;Yamashita et al., 2018) begins by calculating a matrix of pairwise correlation coefficients: A distance matrix defined as d i,j = 1 À r i,j is provided as input to the SciPy (Virtanen et al., 2020) hierarchical clustering routine using the average linkage method. Clusters are sorted by distance, and the completeness and multiplicity of each cluster are reported. Optionally, xia2.multiplex can scale and merge the data sets defined by each cluster that meet userdefined criteria for minimum completeness or multiplicity. A second intensity-based clustering method follows that described by Diederichs (2017), who demonstrated that the methods of Brehm & Diederichs (2014) could be generalized to search for any systematic differences between data sets, not just those caused by an indexing ambiguity. In addition to its use for identifying the Patterson symmetry (Section 2.1), dials.cosym can also be used for analysis of non-isomorphism. In this mode, rather than searching for the presence of potential additional symmetry operators, the matrix of pairwise correlation coefficients of size n 2 reduces to equation (7). The function defined by equation (2) is minimized as before to obtain a representation of the similarity between data sets in a reduced dimensional space.
As made clear by Diederichs (2017), the length of a vector x i is inversely proportional to the random error in data set X i . The angle between vectors x i and x j corresponds to the level of systematic error between data sets X i and X j , and thus can be used to estimate the degree of non-isomorphism between these data sets. Analysis of the angular separation of vectors x can be used to identify groups of systematically different data sets. Hierarchical clustering on the cosines of the angles between vectors is performed to identify possible groupings of data sets for further investigation. Optionally, xia2.multiplex can rescale multiple subsets of data, which can be controlled by specifying a maximum number of clusters to merge and/or the minimum required completeness or multiplicity for a cluster.
The final approach to isomorphism analysis implemented within xia2.multiplex is the ÁCC 1/2 method described by Assmann et al. (2016) and implemented within dials.scale (Beilsten-Edmands et al., 2020). If ÁCC 1/2 filtering is selected then xia2.multiplex will perform additional scaling with dials.scale, rejecting any data sets that are identified as significant outliers according to ÁCC 1/2 analysis. Whilst this approach may not be suitable if there are two or more significant non-isomorphous populations, it may give useful results if there are a small number of data sets that are systematically different from the majority.

Preferential orientation
The report generated by xia2.multiplex includes stereographic projections of the crystal orientation relative to the laboratory frame generated with dials.stereographic_projection. A random distribution of points (each point corresponds to a crystal or its symmetry equivalent) in a stereographic projection suggests a random distribution of crystal orientations, whereas a systematic nonrandom distribution may be indicative of preferential crystal orientation.
xia2.multiplex also generates a number of plots that can aid in the analysis of the distribution of multiplicities.
A new command, dials.missing_reflections, has been developed to identify connected regions of missing reflections in reciprocal space. Prior to performing the analysis, it is necessary to map centred unit cells to the primitive setting in order to avoid systematically absent reflections complicating the analysis. The complete set of possible Miller indices is generated and expanded to cover the full sphere of reciprocal space by the application of symmetry operators belonging to the known space group. This allows the identification of connected regions that cross the boundary of the asymmetric unit. Nearest-neighbour analysis is used to construct a graph of connected regions, which is then used to perform connected components analysis to identify each connected region of missing reflections. Miller indices for missing reflections are then mapped back to the asymmetric unit in order to identify the set of unique Miller indices belonging to each region. A sorted list of connected regions is reported to the user, detailing the resolution range spanned by each region and the number and proportion of total reflections comprising each region.
3. Deployment of xia2.multiplex at Diamond Light Source xia2.multiplex, as described above, has been deployed as part of the autoprocessing pipeline at Diamond Light Source. A series of partial data sets are collected from a set of related crystals, for example from multiple crystals within one or more drops in a crystallization plate (Sanchez-Weatherby et al., 2019), sample loop or sample mesh. After the end of each data collection, the partial data set is processed individually with DIALS via xia2. On the successful completion of xia2, a xia2.multiplex processing job is triggered using all successful xia2 results from this and prior data collections as input. The xia2.multiplex results, including merging statistics, are recorded in ISPyB (Delageniè re et al., 2011) for presentation to the user via SynchWeb , where results are typically available within minutes of the end of data collection. Prior to data collection, users may define groups of related samples for combining with xia2.multiplex either via SynchWeb or via a configuration file in a pre-defined location.
In the absence of this information, xia2.multiplex will only combine data collected from the same sample, i.e. loop, mesh or well within a crystallization plate.
If a PDB file has been associated with the data collection, then automated structure refinement is performed research papers Acta Cryst. (2022). D78, 752-769 with DIMPLE using the merged reflections output by xia2.multiplex.

Room-temperature in situ experimental phasing
Using data from Lawrence et al. (2020), we showcase the application of xia2.multiplex to multi-crystal room-temperature in situ data sets from heavy-atom soaks of lysozyme crystals, demonstrating successful experimental phasing using the resulting xia2.multiplex output. Data from lysozyme crystals soaked with six different heavy-atom solutions were processed individually with DIALS via xia2 followed by symmetry determination (Figs. 3a and 3b), scaling and merging with xia2.multiplex. Partial data sets identified as outliers according to ÁCC 1/2 were rejected in an automated iterative process with xia2.multiplex. Data-processing statistics for each heavyatom soak, with and without ÁCC 1/2 filtering of outlier data sets, are shown in Tables 1 and 2. Phasing was performed with fast_ep using SHELXC/D/E (Sheldrick, 2010). Structure refinement was performed by REFMAC5 (Murshudov et al., 2011) via DIMPLE using PDB entry 6qqf (Gotthard et al., 2019) as the reference structure. Anomalous difference maps were calculated by ANODE (Thorn & Sheldrick, 2011) via the --anode option in DIMPLE.
Significant anomalous signal was observed, as indicated in the SHELXC plot of hd 00 /(I)i versus resolution (Fig. 2a). Substructure searches with SHELXD were successful (Fig. 2b), and traceable electron-density maps were obtained by SHELXE. Anomalous difference maps calculated by ANODE via DIMPLE indicated the presence of significant anomalous difference peaks (Figs. 2c and 2d).
To assess the impact of ÁCC 1/2 filtering on the resulting anomalous signal, we performed experimental phasing and structure refinement (via DIMPLE) and calculated anomalous difference maps using data both with and without ÁCC 1/2 filtering of outliers. Substructure solution and autotracing were successful in both cases. ÁCC 1/2 filtering also resulted in improved merging statistics, typically in CC 1/2 , CC anom , hd 00 /(I)i, hI/(I)i and R p.i.m. versus resolution (Tables 1 and  2). For the NaBr and Sm soaks there are particularly significant improvements in R work and R free after ÁCC 1/2 filtering. These two soaks also correspond to the data sets that showed the largest improvement in anomalous difference peak height after the removal of outlier data sets (Fig. 2d).
We note that merging statistics such as correlation coefficients and R factors, which are calculated only on the unmerged intensity values without taking into account their errors, can be affected by regions of lower data quality that are suitably down-weighted with larger errors during scaling. The presence of these regions, however, does not adversely affect the resulting merged intensities, which are appropriately weighted. This disparity is most likely to be evident for highmultiplicity data with regions of significant radiation damage, in which case merged data-quality indicators are most representative of the data quality.
As outlined in Section 2.5, several different methods are available in xia2.multiplex for identifying outlier data sets. Above, we used ÁCC 1/2 filtering to identify and exclude outlier partial data sets. Visualization of the distribution and  Table 1 Data-collection, merging and refinement statistics for lysozyme room-temperature in situ heavy-atom soaks using all data sets. hierarchical clustering on unit-cell parameters for the Sm soak (Figs. 3e and 3f ) identifies data set 11 as an outlier, which was also the first data set to be excluded by ÁCC 1/2 filtering. Similarly, hierarchical clustering on pairwise correlation coefficients (Fig. 4a) and on the cosines of the angles between vectors x (Figs. 3c, 3d and 4b) both identify data set 11 as an outlier. Whilst in this case all available methods for isomorphism analysis identified data set 11 as the least compatible data set, it is beneficial to have an array of different methods available, as the best method for a particular system may depend on the nature of any nonisomorphism involved.

TehA
Previously published in situ data for Haemophilus influenzae TehA  were used to further demonstrate the applicability of xia2.multiplex and the tools contained therein. 73 partial data sets were processed individually with DIALS via xia2, providing no prior space group or unit-cell information. 71 successfully integrated data sets were provided as input to xia2.multiplex, where data were combined and scaled using dials.cosym and dials.scale. Two data sets were identified as having inconsistent unit cells by preliminary filtering and were removed, leaving 69 data sets for subsequent symmetry analysis and scaling. Structure refinement was performed by REFMAC5 via DIMPLE. Data-processing and refinement statistics using all data and only those remaining after filtering by ÁCC 1/2 are shown in Table 3.
The maximum possible lattice symmetry was determined to be RÀ3m:H, with a maximum of six symmetry operations. Analysis of the value given by equation (2) as a function of the number of dimensions identified that two dimensions were sufficient to explain the variation between data sets. Further symmetry analysis with dials.cosym correctly identified the Patterson group as RÀ3:H, resolving the indexing ambiguity present in this space group (Figs. 5a and 5b).
The best overall unit cell was determined by dials.two_ theta_refine as a = b = 98.76, c = 136.77 Å , and data were scaled together with dials.scale. Resolution analysis with dials.estimate_resolution identified 2.14 Å as the resolution where the fit of a hyperbolic tangent to CC 1/2 ' 0.3.
Six cycles of scaling and filtering were performed by dials.scale, where exclusion was performed on whole data sets. A single outlier data set (using a cutoff of 3) was removed in each of the first five cycles, removing a total of 6.2% of the reflections. No significant outliers were identified in the sixth and final cycle.
Structure refinement was performed by REFMAC5 via DIMPLE with the model from PDB entry 4ycr , using all scaled data and after filtering of outliers using the ÁCC 1/2 method. Filtering of outlier data sets leads to a slight improvement in the merging statistics, particularly in hI/(I)i and R p.i.m. . There is also a slight reduction in the R work and R free reported by REFMAC5.
Stereographic projections of crystal orientations with dials.stereographic_projection shows that preferential crystal orientatation may be an issue for this experiment ( Table 2 Data-collection, merging and refinement statistics for lysozyme room-temperature in situ heavy-atom soaks after the removal of data sets identified by ÁCC 1/2 analysis. 5d). Fig. 5(e) and 5( f ) show the consequences that this has on the distribution of multiplicities in the resulting data set. Analysis with dials.missing_reflections identifies a single region of missing reflections, comprising 1390 reflections (5.2%) covering the range 53.41-2.14 Å .
Fragment-screening experiments such as these are typically carried out using conventional cryogenic conditions to minimize the effects of radiation damage, with each structure being obtained from a single crystal. Room-temperature data, however, can usefully identify or rule out structural artefacts induced by pushing the temperature far from the biologically relevant level Guven et al., 2021).
Over the course of several beamline visits, room-temperature in situ data were collected for 30 ligand soaks that had previously shown ligand binding under cryogenic conditions. Here, we highlight room-temperature data collections for five ligand soaks that showed evidence of ligand binding at room temperature: Z1367324110 (PDB entry 5r81) and Z31792168 (PDB entry 5r84) (Douangamath et al., 2020), Z4439011520 (PDB entry 5rh5), Z4439011584 (PDB entry 5rh7) and ABT-957 (PDB entry 7aeh) (Redhead et al., 2021).
Data were collected on beamline I24 at Diamond Light Source with a Dectris PILATUS 3 6M detector using a 30 Â  dials.cosym plots for data from lysozyme Sm soaks as described in Section 4.1. (a) Histogram of (n Â m) 2 pairwise R ij correlation coefficients and (b) the (n Â m) vectors x determined by the minimization of equation (2) during symmetry determination with dials.cosym. The R ij correlation coefficients are clustered towards 1 and the majority of the vectors x form a single cluster, suggesting the absence of an indexing ambiguity, i.e. the Patterson group of the data set corresponds to the maximum lattice symmetry. (c, d) As above but after symmetry determination and scaling. The distribution of the n 2 R ij correlation coefficients is sharpened towards 1 as scaling improves the internal consistency of the data. There is also an effect from multiplicity when comparing with (a), as here the n 2 R ij values are calculated in the highest symmetry group for the lattice. All but one of the n vectors x form a tight cluster, with the vector lengths close to 1. Visualization of (e) the distribution of unit-cell parameters and ( f ) clustering on unit-cell parameters suggests the presence of an outlier data set. the constraints imposed by the experimental setup. Based on typical crystal dimensions of 50 Â 50 Â 5 mm, the X-ray dose per data collection was estimated to be in the range 50-67 kGy using RADDOSE-3D (Zeldin et al., 2013;Bury et al., 2018). RADDOSE-3D input and output files are included in the supporting information.
As described in Section 3, data sets were automatically processed individually with DIALS via xia2, followed by combined scaling and merging after each data collection with xia2.multiplex. Automatic structure refinement and difference map calculations were performed using DIMPLE.
410 data sets were collected in a single visit at a maximum throughput of 46 data sets per hour. The median time from the end of data collection to the completion of the associated processing job was 222.5 and 352 s for xia2.multiplex and DIMPLE, respectively. 98% of DIMPLE results were reported within 10 min of data collection finishing (see also Supplementary Fig. S1).
Figs. 6(a)-6(c) show the improvement in the merging statistics for the autoprocessed data on the addition of each new data set. There is a visible improvement in the quality of the DIMPLE electron-density map with the number of crystals .
Analysis of the distribution of unit-cell parameters and clustering on unit-cell parameters indicated the presence of potential outlier data sets (Figs. 7a and 7b). Reprocessing with a lower unit-cell clustering threshold resulted in improved merging statistics for some data sets (Figs. 7e and 7f). Alternatively, ÁCC 1/2 analysis may be useful in identifying outlier data sets. For ligand soak Z4439011520, ÁCC 1/2 analysis by dials.scale identified two outlier data sets over two rounds of scaling and filtering (Figs. 7c and 7d). ÁCC 1/2 filtering removed data sets 0 and 18, which were also the two least compatible data sets identified by unit-cell clustering, although only the latter was identified as an outlier according to the chosen unit-cell clustering threshold.
Using the data improved by the rejection of outlier data sets as above, initial structure solution was performed using MOLREP (Vagin & Teplyakov, 2010) with PDB entry 7aeh as the search model. Structures were refined for 200 cycles in REFMAC5 using rigid-body refinement, followed by iterative rounds of restrained refinement with automatic TLS and assisted model building in Coot (  Hierarchical clustering (a) on pairwise correlation coefficients and (b) on the cosines of the angles between vectors in Fig. 3(d) identify the presence of an outlier data set.   data-processing and refinement statistics for five ligand soaks, Z1367324110, Z31792168, Z4439011520, Z4439011584 and ABT-957, are reported in  Incremental processing with xia2.multiplex and DIMPLE of in situ data collections of SARS-CoV-2 M pro ligand soak Z4439011520. (a, b) CC 1/2 and R p.i.m. data-processing statistics for ligand Z4439011520 with the inclusion of progressively more data sets in data-collection order from top left to bottom right. (c, d) Overall data completeness and gemmi (https://gemmi.readthedocs.io) blob search scores. (e, f, g) The ligand density in the autoprocessed DIMPLE maps for two, nine and 20 crystals, respectively. All contours are drawn at 3. for the cryo-structure with this ligand (Redhead et al., 2021). Autoprocessing (including both xia2 and xia2.multiplex) was performed both using the user-specified target space group, C2, and with automatic space-group determination. Out of 42 data sets collected, 18 data sets were successfully autoprocessed with DIALS via xia2 in the target space group C2 and combined with xia2.multiplex. In contrast, all 42 data sets individually processed successfully with automatic spacegroup determination in a mixture of space groups P1, P2, P2 1 and C2. 33 data sets remained after filtering for inconsistent unit cells. Analysis of symmetry with dials.cosym identified the Patterson group P2/m, which features an indexing ambiguity due to the approximate pseudo-symmetry of the supergroup C2 (Tables 5 and 6).
Of the ligand-soaked structures obtained, all showed a nearidentical binding conformation in the cryogenic and roomtemperature structures. A minor difference was observed in the conformation of ABT-957, with the C9-N-C1(R) amide bond in the room-temperature structure being flipped compared with the cryogenic structure (Fig. 8). This amide flip had a knock-on effect on the rotomer of the -lactam ring and the benzylic side chain which stems from N1 of the -lactam.
Inspection of a plot of R cp versus image number (Supplementary Fig. S2) showed slight signs of radiation damage for some ligand soaks. Whilst limiting the number of images used from each data set may lead to improvements in some merging statistics ( Supplementary Fig. S3), at the cost of completeness and multiplicity, this did not lead to any appreciable difference in the ligand density in the final structures ( Supplementary  Fig. S4).

Conclusions
xia2.multiplex has been developed to perform symmetry analysis, scaling and merging of multiple data sets. It is distributed with DIALS and hence CCP4, and is available as part of the autoprocessing pipelines across the MX beamlines at Diamond Light Source, including integration with downstream phasing pipelines such as DIMPLE and Big EP. It is capable of providing near real-time feedback on data quality and completeness during ongoing multi-crystal data collec-  Table 4 Data-collection, merging and refinement statistics for SARS-CoV-2 M pro in situ data sets after filtering of outliers according to ÁCC 1/2 .

Figure 7
Outlier identification and removal for SARS-CoV-2 M pro ligand soak Z4439011520. Visualization of (a) the distribution of unit-cell parameters and (b) clustering on unit-cell parameters may suggest possible outlier data sets. (c, d) ÁCC 1/2 filtering with dials.scale can also remove data sets that strongly disagree with the majority of data sets. (e, f ) Removing outlier data sets can improve the overall merging statistics. tions, and can be used as part of an iterative workflow to obtain the best possible final data set after an experiment. We have demonstrated its applicability using two previously published room-temperature in situ multi-crystal data sets, including an example of experimental phasing. Using data sets collected as part of in situ room-temperature fragmentscreening experiments on SARS-CoV-2 M pro , we have shown the ability of xia2.multiplex to provide rapid feedback during multi-crystal experiments, including the identification of an unexpected change in space group on ligand addition.
Remaining challenges include the automatic identification of the best subset(s) of data to use for downstream analyses, and providing a user interface via applications such as SynchWeb or CCP4 to view results and facilitate an interactive workflow using xia2.multiplex. Support for MTZ files as input is planned in order to enable running xia2.multiplex on the output of other data-processing software such as XDS (Kabsch, 2010) and MOSFLM (Battye et al., 2011).